Getting set up

Start an RStudio session at OSC

Show instructions

  • Fill out the form as shown here.

  • Now, you should see a box like this:

  • Your job should start running pretty soon, and when it’s ready the box should look like this:

  • Click Connect to RStudio Server at the bottom of the box, and an RStudio Server instance will open. You’re ready to go!

Create an RStudio Project

Show instructions
basedir <- "/fs/project/PAS0471/teach/misc/2021-02_rnaseq/"

# Get your user name (by running a shell command via the `system()` function:
me <- system("echo $USER", intern = TRUE)

# Build the path to the target dir:
proj_dir <- file.path(basedir, me)

# Create the Project:
library(usethis)
create_project(path = proj_dir)

Now, RStudio will reload with the newly created Project open.

If you get the pop-up below, click Don't Save (do this whenever you get that pop-up):

Copy the count data and the metadata files

Go the Terminal tab next to the R Console in the bottom left of the RStudio window:

cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER

cp ../master/data/meta/metadata.txt data/meta
cp ../master/results/count/gene_counts_all.txt results/count

Load the necessary packages

if(! "pacman" %in% installed.packages()) install.packages("pacman")

packages <- c("DESeq2",          # Differential expression analysis
              "tidyverse",       # Misc. data manipulation and plotting
              "here",            # Managing file paths
              "pheatmap",        # Heatmap plot
              "apeglm")          # For LFC shrinkage
pacman::p_load(char = packages)

theme_set(theme_bw())  # Set ggplot theme

Input and output dirs and files

Input files:

  • Gene counts table – the file exactly as it was written by featureCounts.

  • Metadata table – so we can group our samples and make comparisons between these groups.

count_table_file <- here("results/count/gene_counts_all.txt")
metadata_file <- here("data/meta/metadata.txt")

Output directories – and we create them if they don’t already exist:

outdir <- here("results/DE/")
plotdir <- here("results/DE/fig/")

if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)

Load input data

Load the count table from featureCounts:

raw_counts <- read.table(count_table_file,
                         sep = "\t", header = TRUE, skip = 1)

Load the metadata information:

metadata <- read.table(metadata_file, header = TRUE)

head(metadata)
##   SampleID Experiment GH_trial AMF           Treatment
## 1  C6_myb2       CAPS   CAPS_6  Ri Agrobacterium_noexp
## 2  C6_myb3       CAPS   CAPS_6  Ri Agrobacterium_noexp
## 3  C6_myb4       CAPS   CAPS_6  Ri Agrobacterium_noexp
## 4    CT_05       CAPS   CAPS_6  Ri                mock
## 5    CT_06       CAPS   CAPS_6  Ri                mock
## 6    CT_09       CAPS   CAPS_6  Ri                mock

The Treatment column currently has the values Agrobacterium_noexp, Agrobacterium_myb, and mock.

To shorten this a bit, we’ll get rid of “Agrobacterium_”:

metadata$Treatment <- sub("Agrobacterium_", "", metadata$Treatment)

unique(metadata$Treatment)
## [1] "noexp" "mock"  "myb"



Prepare the data

Change the column names, which are very long:

colnames(raw_counts)[7:8]
## [1] "X.fs.scratch.PAS1548.JamesSeq.CAPS.tomato.star_out.X6465_Benitez.PonceM_C6_myb2_V1N_1_S1_L001_R1_001.fastq.gzAligned.sortedByCoord.out.bam"
## [2] "X.fs.scratch.PAS1548.JamesSeq.CAPS.tomato.star_out.X6465_Benitez.PonceM_C6_myb2_V1N_1_S1_L002_R1_001.fastq.gzAligned.sortedByCoord.out.bam"
my_regex <- ".+PonceM_(.+)_V1N.+"
colnames(raw_counts) <- sub(my_regex, "\\1", colnames(raw_counts))

colnames(raw_counts)
##  [1] "Geneid"  "Chr"     "Start"   "End"     "Strand"  "Length"  "C6_myb2"
##  [8] "C6_myb2" "C6_myb3" "C6_myb3" "C6_myb4" "C6_myb4" "CT_05"   "CT_05"  
## [15] "CT_06"   "CT_06"   "CT_09"   "CT_09"   "T_26"    "T_26"    "T_29"   
## [22] "T_29"    "T_30"    "T_30"    "cnt_1"   "cnt_1"   "cnt_2"   "cnt_2"  
## [29] "cnt_3"   "cnt_3"   "myb_1"   "myb_1"   "myb_2"   "myb_2"   "myb_3"  
## [36] "myb_3"

Besides the counts, there are columns with metadata for each gene:

raw_counts[1:5, 1:8]
##             Geneid       Chr  Start    End Strand Length C6_myb2 C6_myb2.1
## 1 Solyc00g160260.1 SL4.0ch00  83863  84177      +    315       0         0
## 2 Solyc00g160270.1 SL4.0ch00 166754 167268      -    515       0         0
## 3 Solyc00g500003.1 SL4.0ch00 311496 382066      -  70571       1         4
## 4 Solyc00g500004.1 SL4.0ch00 417592 418482      +    891       0         0
## 5 Solyc00g500005.1 SL4.0ch00 478389 478640      +    252       0         0

Let’s remove those:

counts <- raw_counts[, 7:ncol(raw_counts)]
rownames(counts) <- raw_counts$Geneid

Replicate samples

In this table, there are two separate entries for each sample: each library was sequenced on two lanes. Recall that in our workflow, we had merged these FASTQ files prior to mapping, but here we are using the table based on the full dataset produced by Matthew.

So, we will go ahead and merge these replicates now, by simply summing the counts:

counts.t <- t(counts)
rownames(counts.t) <- names(raw_counts)[7:36]
sums <- rowsum(counts.t, group = rownames(counts.t))
counts <- t(sums)

(Alternatively, one could use the specialized function DESeq2::collapseReplicates() for this.)

Check sample IDs

For differential expression analysis, we will be using the popularDESeq2 R/Bioconductor package (paper, website).

We will load both the count table and the metadata into DESeq2. When doing so, DESeq2 assumes that sample IDs in both tables match and are provided in the same order. Let’s make sure this is indeed the case.

Sort both data frames alphabetically:

metadata <- metadata[order(metadata$SampleID), ]
counts <- counts[, order(colnames(counts))]

Check if names are the same:

metadata$SampleID
##  [1] "C6_myb2" "C6_myb3" "C6_myb4" "cnt_1"   "cnt_2"   "cnt_3"   "CT_05"  
##  [8] "CT_06"   "CT_09"   "myb_1"   "myb_2"   "myb_3"   "T_26"    "T_29"   
## [15] "T_30"
colnames(counts)
##  [1] "C6_myb2" "C6_myb3" "C6_myb4" "cnt_1"   "cnt_2"   "cnt_3"   "CT_05"  
##  [8] "CT_06"   "CT_09"   "myb_1"   "myb_2"   "myb_3"   "T_26"    "T_29"   
## [15] "T_30"
matching_names <- identical(metadata$SampleID, colnames(counts))
matching_names
## [1] TRUE
if(matching_names == FALSE) stop("Sample ID in metadata and count matrix do not match!")

Create the DESeq2 object

We will create the DESeq2 object using the function DESeqDataSetFromMatrix(), which we will provide with three pieces of information:

  • The count data with argument countData.
  • The metadata with argument colData.
  • The model design for the DE analysis – argument design.
    For now, we will specify ~1, which means “no design” – we will change this before the actual DE analysis.
dds_raw <- DESeqDataSetFromMatrix(countData = counts,
                                  colData = metadata,
                                  design = ~ 1)



Explore the count data

What are number of rows and columns of the count matrix?

dim(counts)
## [1] 34688    15

How many genes have non-zero counts?

dim(counts[rowSums(counts) > 0, ])
## [1] 28145    15

How many genes have total counts of at least 10?

dim(counts[rowSums(counts) >= 10, ])
## [1] 24771    15

Histogram of gene counts

Let’s plot a histogram of gene counts:

theme_set(theme_bw())

summed_gene_counts <- data.frame(gene_count = rowSums(counts)) %>%
  rownames_to_column("gene_id")

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 10000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0,0))

Zoom in a bit:

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 1000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(limits = c(0, 200000), expand = c(0,0)) +
  theme(plot.margin = margin(0.5, 0.7, 0.5, 0.5, "cm"))

How are counts distributed across samples? That is, we would like a sum of counts for each column. To get this, we use the apply() function, which can apply a function (in our case sum()) to all columns (hence MARGIN = 2 – for rows, use 1) of our counts dataframe:

apply(X = counts, MARGIN = 2, FUN = sum)
##  C6_myb2  C6_myb3  C6_myb4    cnt_1    cnt_2    cnt_3    CT_05    CT_06 
## 18122406 21913320 13659702 18844494 18890045 16634169 22183005 20620412 
##    CT_09    myb_1    myb_2    myb_3     T_26     T_29     T_30 
## 22924853 22070790 22829207 23195471 19902716 23221528 15793928



Principal Component Analysis (PCA)

Run the PCA and prepare for plotting

First, we normalize the count data to have even sampling across samples (with respect to library size) and approximately even variance:

vsd <- varianceStabilizingTransformation(dds_raw, blind = TRUE)

Next, we run the PCA and retrieve the data to plot with ggplot2:

pcaData <- plotPCA(vsd,
                   ntop = 500,
                   intgroup = c("AMF", "Treatment"),
                   returnData = TRUE)

We extract the percentage of variance explained by different principal components, so we can later add this information to the plot:

percentVar <- round(100 * attr(pcaData, "percentVar"))
percentVar
## [1] 59 15

We create a plot title with the species name in italic using the somewhat bizarre expression() function:

plot_title <- expression("PCA of " * italic(Glycine ~ max) * " transcript count data")

Plot the PCA results

ggplot(pcaData,
       aes(x = PC1, y = PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 6) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

Plot again – with sample names

library(ggrepel)

pca_plot2 <- ggplot(pcaData,
              aes(PC1, PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 3) +
  geom_label_repel(aes(label = name)) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

print(pca_plot2)



DE analysis – full dataset

The design has two factors: AMF and Treatment. Rather than fit a multivariate model, we can start by merging the two into a single factor called group, and fit a univariate model with this factor.

dds_raw$group <- factor(paste(dds_raw$AMF, dds_raw$Treatment, sep = "_"))
table(dds_raw$group)
## 
## control_mock  control_myb      Ri_mock       Ri_myb     Ri_noexp 
##            3            3            3            3            3

We will set the “reference” level of the factor to be the double negative control (empty substrate, no Agrobacteria):

dds_raw$group <- relevel(dds_raw$group, ref = "control_mock")
dds_raw$group
##  [1] Ri_noexp     Ri_noexp     Ri_noexp     control_mock control_mock
##  [6] control_mock Ri_mock      Ri_mock      Ri_mock      control_myb 
## [11] control_myb  control_myb  Ri_myb       Ri_myb       Ri_myb      
## Levels: control_mock control_myb Ri_mock Ri_myb Ri_noexp

Next, we set the analysis design:

design(dds_raw) <- formula(~ group)

And finally, we perform the differential expression analysis with the DEseq() function:

dds <- DESeq(dds_raw)

The DESeq() function above performs three steps consecutively:

  • estimateSizeFactors() – “Normalization” by library size. Note that DESeq2 doesn’t actually normalize the counts, it uses raw counts and includes the size factors in the modeling.

  • estimateDispersions() – Estimate gene-wise dispersions.

  • nbinomWaldTest(ddsObj) – Fit the negative Binomial GLM and calculate Wald statistics.

These functions could also be called separately, which would be useful if you want to be able to change more defaults.

The results table

res <- results(dds)
res
## log2 fold change (MLE): group Ri noexp vs control mock 
## Wald test p-value: group Ri noexp vs control mock 
## DataFrame with 34688 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat    pvalue
##                  <numeric>      <numeric> <numeric> <numeric> <numeric>
## Solyc00g160260.1   0.00000             NA        NA        NA        NA
## Solyc00g160270.1   0.00000             NA        NA        NA        NA
## Solyc00g500003.1   9.39133       -2.20718   1.24311  -1.77554 0.0758095
## Solyc00g500004.1   0.00000             NA        NA        NA        NA
## Solyc00g500005.1   0.00000             NA        NA        NA        NA
## ...                    ...            ...       ...       ...       ...
## Solyc12g100330.2   1140.38       0.701332  0.274283  2.556962 0.0105591
## Solyc12g100340.1      0.00             NA        NA        NA        NA
## Solyc12g100360.1   1552.93      -0.285219  0.306501 -0.930565 0.3520788
## Solyc12g100363.1      0.00             NA        NA        NA        NA
## Solyc12g100367.1      0.00             NA        NA        NA        NA
##                       padj
##                  <numeric>
## Solyc00g160260.1        NA
## Solyc00g160270.1        NA
## Solyc00g500003.1  0.241664
## Solyc00g500004.1        NA
## Solyc00g500005.1        NA
## ...                    ...
## Solyc12g100330.2  0.067777
## Solyc12g100340.1        NA
## Solyc12g100360.1  0.593960
## Solyc12g100363.1        NA
## Solyc12g100367.1        NA

By default, the results table prints statistics comparing the last level of the factor with the first level: that is, log-fold change and p-values describe differences between these two levels specifically. However, we can easily extract equivalent statistics for any pairwise comparison among our factor levels, which we will see later.

For now, we will explore what each column in this table means:

  • The baseMean column contains the mean expression level across all samples.

  • The log2FoldChange column contains the log2-fold change of gene counts between the compared levels, that is, it represents the effect size. A log2-fold change of 1 indicates that the expression in the reference level is 2 times lower than that of the other level, a log2-fold change of 2 indicates a four-fold difference, a log2-fold change of 3 indicates an eight-fold difference, and so on. Similarly, negative log2-fold change values indicate an change in gene counts in the other direction.

  • The lfcSE column indicates the uncertainty in terms of the standard error (SE) of the log2-fold change estimate.

  • The stat column indicates the value for the Wald test’s test statistic.

  • The pvalue column reported the uncorrected p-value from the Wald test.

  • Because we are testing significance for many genes, we need to correct for multiple testing. DESeq2 uses the Benjamini-Hochberg False Discovery Rate (FDR) correction, and these values are reported in the column padj (i.e., adjusted p-value).

A summary of this information about each column can be seen by running the mcols() function:

mcols(res)
## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: grou..
## stat                results Wald statistic: grou..
## pvalue              results Wald test p-value: g..
## padj                results   BH adjusted p-values

NA values in the results table

Some values in the results table can be set to NA for one of the following reasons:

  • If within a row, all samples have zero counts, the baseMean column will be zero, and the log2-fold change estimates, p-value and adjusted p-value will all be set to NA.

  • DESeq2 performs outlier detection using Cook’s distance. If a gene contains a sample with an count outlier, both the p-value and adjusted p-value will be set to NA.

  • DESeq2 also automatically filters low mean count genes in the sense that it does not include them in the multiple testing correction. Therefore, in such cases, the p-value will not be NA, but the adjusted p-value will be.

    Because we have very low power to detect differential expression for such low-count genes, it is beneficial to remove them prior to the multiple testing correction, so this correction becomes less severe for the remaining genes.

Let’s see how many genes have NA p-values:

# Number of genes with NA p-value:
sum(is.na(res$pvalue))
## [1] 6601
# As a proportion of the total number of genes in the test:
sum(is.na(res$pvalue)) / nrow(res)
## [1] 0.1902964

And NA adjusted p-values:

# Number of genes with NA p-value:
sum(is.na(res$padj))
## [1] 11991
# As a proportion of the total number of genes in the test:
sum(is.na(res$padj)) / nrow(res)
## [1] 0.3456815



DE analysis – contrast two custom groups

Although not displayed in a particularly readable ways, we can see which pairwise contrasts between different levels of the factor are available:

resultsNames(dds)
## [1] "Intercept"                         "group_control_myb_vs_control_mock"
## [3] "group_Ri_mock_vs_control_mock"     "group_Ri_myb_vs_control_mock"     
## [5] "group_Ri_noexp_vs_control_mock"

Not all pairwise contrasts between the 5 levels in our group factor are available: instead, control_mock, which we set as the reference level, is being compared with all 4 other levels.

Above, we looked at the results (p-values and so on) for the last of these comparisons, which is what DESeq2 will show by default. To see the results table for one of the other 3 comparisons, we pass a vector to the contrast argument of the results() function with the factor (group) and the two levels to be contrasted:

my_contrast <- c("Ri_mock", "control_mock")

res <- results(dds,
               contrast = c("group", my_contrast))

How many adjusted p-values were less than 0.1?

sum(res$padj < 0.1, na.rm = TRUE)
## [1] 9014

We’ll also create an object with adjusted (shrunken) LFC estimates which will be useful for visualization and ranking of genes by LFC.

my_coef <- paste0("group_", paste0(my_contrast, collapse = "_vs_"))
my_coef
## [1] "group_Ri_mock_vs_control_mock"
res_LFC <- lfcShrink(dds, coef = my_coef, type = "apeglm", lfcThreshold = 1)

Note that here, we had to provide the contrast (“coefficient”) in the format given by resultsNames(dds), which looks confusing because it uses underscores as a separator, and our factor levels also contain underscores.



Exploring the results

MA-plot:

For a nice overview of the results, we can plot a so-called “MA plot”, which plots, for each gene, count differences (as LFC) between two groups on the y-axis, and mean counts across both groups on the x-axis.

We can create an MA plot using DESeq2’s plotMA function, with significantly differentially expressed genes (using alpha = 0.1) in blue:

plotMA(res, ylim = c(-5, 5))

To be able to customize the plot, we’ll use returnData = TRUE like we have done with previous plots, and then plot the resulting dataframe with ggplot2:

d <- plotMA(res, returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_continuous(limits = c(-10, 10)) +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "LFC")

We can see that lowly-expressed genes tend to deviate from an LFC of 0 (same mean expression levels in the two groups) much more than 0. However, this is an artifact of noise overpowering signal with such low expression values. We can also see that genes are never significantly expressed in the same far left part of the plot due to this lack of power.

DESeq2 provides several methods to adjust LFC estimates for the low-expression bias, one of which we used above to produce the res_LFC object. Let’s create another MA plot for these adjusted estimates:

d <- plotMA(res_LFC, ylim = c(-5, 5), returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

We can also make this plot interactive with Plotly, so we can see the identity of each gene by hovering over the point:

library(plotly)

d$gene <- rownames(d)

p_ma <- ggplot(d,
               aes(x = mean, y = lfc, color = isDE, text = gene)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

ggplotly(p_ma, tooltip = "text")

Plot specific genes

We can also create plot of expression levels in individual genes, which will be especially interesting for genes with highly significant differential expression.

Let’s plot the top-5 most significantly differentially expressed genes:

top5 <- row.names(res[order(res$padj), ][1:5, ])

for(fgene in top5) {
  
  d <- plotCounts(dds,
                  gene = fgene,
                  intgroup = "group",
                  returnData = TRUE)

  p <- ggplot(d, aes(x = group, y = count)) +
          geom_point(position = position_jitter(w = 0.1, h = 0)) +
          labs(title = fgene) +
          theme_bw()
  
  print(p)
}

Heatmap

We can create heatmaps with the pheatmap function – for instance, for the 20 most highly expressed genes:

ntd <- normTransform(dds)
selection <- order(rowMeans(counts(dds, normalized = TRUE)),
                   decreasing = TRUE)[1:20]
ntd_sel <- assay(ntd)[selection, ]
df_meta <- as.data.frame(colData(dds)[, c("AMF", "Treatment")])
pheatmap(ntd_sel,
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         annotation_col = df_meta)

Export the results

TBA



DE analysis – with two factors

TBA

---
title: "Differential expression (DE) analysis <br> with DESeq2"
author: "Jelmer Poelstra"
date: "3/26/2021"
output:
  html_document:
    code_download: true
    toc_depth: 3
    toc: true
    toc_float: true
    theme: cerulean
    highlight: tango
    anchor_sections: true
    css: html_page.css
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo=TRUE, eval=TRUE, cache=TRUE,
  warning=FALSE, message=FALSE,
  class.source = "r_code",
  class.output = "r_output",
  class.warning = "r_warning",
  class.message = "r_warning",
  class.error = "r_error"
  )
```

<br> <br>

----

## Getting set up

### Start an RStudio session at OSC

<details><summary>Show instructions</summary>

- Login to OSC at <https://ondemand.osc.edu>.

- Click on `Interactive Apps` (top bar) > `RStudio Server (Owens and Pitzer)`

<p align="center">
<img src=img/apps.png width="500">
</p>

- Fill out the form as shown [here](slides/03-OSC-slides.html#rstudio_server_job).

- Now, you should see a box like this:

<p align="center">
<img src=img/osc_queued.png width="700">
</p>

- Your job should start running pretty soon, and when it's ready the box should look like this: 

<p align="center">
<img src=img/osc_running.png width="700">
</p>

- Click `Connect to RStudio Server` at the bottom of the box, and an RStudio Server instance will open. You're ready to go!

</details>

### Create an RStudio Project

<details><summary>Show instructions</summary>

```{r, eval=FALSE}
basedir <- "/fs/project/PAS0471/teach/misc/2021-02_rnaseq/"

# Get your user name (by running a shell command via the `system()` function:
me <- system("echo $USER", intern = TRUE)

# Build the path to the target dir:
proj_dir <- file.path(basedir, me)

# Create the Project:
library(usethis)
create_project(path = proj_dir)
```

Now, RStudio will reload with the newly created Project open.

If you get the pop-up below, click `Don't Save` (do this whenever you get that pop-up):

<p align="center">
<img src=img/rdata-popup.png width="350">
</p>

</details>

### Copy the count data and the metadata files

Go the `Terminal` tab next to the R `Console` in the bottom left of the RStudio
window:

```{sh, eval=FALSE}
cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER

cp ../master/data/meta/metadata.txt data/meta
cp ../master/results/count/gene_counts_all.txt results/count
```

### Load the necessary packages

```{r, warning=FALSE, message=FALSE}
if(! "pacman" %in% installed.packages()) install.packages("pacman")

packages <- c("DESeq2",          # Differential expression analysis
              "tidyverse",       # Misc. data manipulation and plotting
              "here",            # Managing file paths
              "pheatmap",        # Heatmap plot
              "apeglm")          # For LFC shrinkage
pacman::p_load(char = packages)

theme_set(theme_bw())  # Set ggplot theme
```

### Input and output dirs and files

```{r, eval=TRUE, echo=FALSE}
count_table_file <- "results/count/gene_counts_all.txt"
metadata_file <- "data/meta/metadata.txt"

outdir <- "results/DE/"
plotdir <- "results/DE/fig/"
```

Input files:

- **Gene counts table** -- the file exactly as it was written by *featureCounts*.
  
- **Metadata table** -- so we can group our samples and make comparisons between
  these groups.

```{r, eval=FALSE}
count_table_file <- here("results/count/gene_counts_all.txt")
metadata_file <- here("data/meta/metadata.txt")
```

Output directories -- and we create them if they don't already exist:

```{r}
outdir <- here("results/DE/")
plotdir <- here("results/DE/fig/")

if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
```

### Load input data

Load the count table from *featureCounts*:

```{r, eval=TRUE}
raw_counts <- read.table(count_table_file,
                         sep = "\t", header = TRUE, skip = 1)
```

Load the metadata information:

```{r}
metadata <- read.table(metadata_file, header = TRUE)

head(metadata)
```

The `Treatment` column currently has the values `Agrobacterium_noexp`,
`Agrobacterium_myb`, and `mock`.

To shorten this a bit, we'll get rid of "Agrobacterium_":

```{r}
metadata$Treatment <- sub("Agrobacterium_", "", metadata$Treatment)

unique(metadata$Treatment)
```


<br>

----

## Prepare the data

Change the column names, which are very long:

```{r}
colnames(raw_counts)[7:8]
```

```{r}
my_regex <- ".+PonceM_(.+)_V1N.+"
colnames(raw_counts) <- sub(my_regex, "\\1", colnames(raw_counts))

colnames(raw_counts)
```

Besides the counts, there are columns with metadata for each gene:

```{r}
raw_counts[1:5, 1:8]
```

Let's remove those:

```{r}
counts <- raw_counts[, 7:ncol(raw_counts)]
rownames(counts) <- raw_counts$Geneid
```

### Replicate samples

In this table, there are two separate entries for each sample:
each library was sequenced on two lanes.
Recall that in our workflow, we had merged these FASTQ files prior to mapping,
but here we are using the table based on the full dataset produced by Matthew.

So, we will go ahead and merge these replicates now, by simply summing the counts:

```{r}
counts.t <- t(counts)
rownames(counts.t) <- names(raw_counts)[7:36]
sums <- rowsum(counts.t, group = rownames(counts.t))
counts <- t(sums)
```

(Alternatively, one could use the specialized function
`DESeq2::collapseReplicates()` for this.)

### Check sample IDs

For differential expression analysis, we will be using the popular**DESeq2**
R/Bioconductor package
([paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8),
[website](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)).

We will load both the count table and the metadata into *DESeq2*.
When doing so, *DESeq2* assumes that sample IDs in both tables match and 
are provided in the same order. Let's make sure this is indeed the case.

Sort both data frames alphabetically:

```{r}
metadata <- metadata[order(metadata$SampleID), ]
counts <- counts[, order(colnames(counts))]
```

Check if names are the same:

```{r}
metadata$SampleID

colnames(counts)

matching_names <- identical(metadata$SampleID, colnames(counts))
matching_names
if(matching_names == FALSE) stop("Sample ID in metadata and count matrix do not match!")
```

### Create the DESeq2 object

We will create the DESeq2 object using the function `DESeqDataSetFromMatrix()`,
which we will provide with three pieces of information:

- The count data with argument `countData`.
- The metadata with argument `colData`.
- The model design for the DE analysis -- argument `design`.  
  For now, we will specify `~1`, which means "no design" --
  we will change this before the actual DE analysis.

```{r}
dds_raw <- DESeqDataSetFromMatrix(countData = counts,
                                  colData = metadata,
                                  design = ~ 1)
```

<br>

----

## Explore the count data

What are number of rows and columns of the count matrix?

```{r}
dim(counts)
```

How many genes have non-zero counts?

```{r}
dim(counts[rowSums(counts) > 0, ])

```

How many genes have total counts of at least 10?

```{r}
dim(counts[rowSums(counts) >= 10, ])
```

### Histogram of gene counts

Let's plot a histogram of gene counts:

```{r}
theme_set(theme_bw())

summed_gene_counts <- data.frame(gene_count = rowSums(counts)) %>%
  rownames_to_column("gene_id")

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 10000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0,0))
```

Zoom in a bit:

```{r}
ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 1000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(limits = c(0, 200000), expand = c(0,0)) +
  theme(plot.margin = margin(0.5, 0.7, 0.5, 0.5, "cm"))
```

How are counts distributed across samples?
That is, we would like a sum of counts for each column.
To get this, we use the `apply()` function, which can apply a function
(in our case `sum()`) to all columns (hence `MARGIN = 2` -- for rows, use `1`)
of our `counts` dataframe:

```{r}
apply(X = counts, MARGIN = 2, FUN = sum)
```

<br>

----

## Principal Component Analysis (PCA) 

### Run the PCA and prepare for plotting

First, we normalize the count data to have even sampling across samples
(with respect to library size) and approximately even variance:

```{r}
vsd <- varianceStabilizingTransformation(dds_raw, blind = TRUE)
```

Next, we run the PCA and retrieve the data to plot with *ggplot2*:

```{r}
pcaData <- plotPCA(vsd,
                   ntop = 500,
                   intgroup = c("AMF", "Treatment"),
                   returnData = TRUE)
```

We extract the percentage of variance explained by different principal components,
so we can later add this information to the plot:

```{r}
percentVar <- round(100 * attr(pcaData, "percentVar"))
percentVar
```

We create a plot title with the species name in italic using the
somewhat bizarre `expression()` function: 

```{r}
plot_title <- expression("PCA of " * italic(Glycine ~ max) * " transcript count data")
```

### Plot the PCA results

```{r}
ggplot(pcaData,
       aes(x = PC1, y = PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 6) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)
```

### Plot again -- with sample names

```{r}
library(ggrepel)

pca_plot2 <- ggplot(pcaData,
              aes(PC1, PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 3) +
  geom_label_repel(aes(label = name)) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

print(pca_plot2)
```

<br>

----

## DE analysis -- full dataset

The design has two factors: `AMF` and `Treatment`.
Rather than fit a multivariate model, we can start by merging the two into a
single factor called `group`, and fit a univariate model with this factor.

```{r}
dds_raw$group <- factor(paste(dds_raw$AMF, dds_raw$Treatment, sep = "_"))
table(dds_raw$group)
```

We will set the "reference" level of the factor to be the double negative control
(empty substrate, no Agrobacteria):

```{r}
dds_raw$group <- relevel(dds_raw$group, ref = "control_mock")
dds_raw$group
```

Next, we set the analysis design:

```{r}
design(dds_raw) <- formula(~ group)
```

And finally, we perform the differential expression analysis with the `DEseq()`
function:

```{r}
dds <- DESeq(dds_raw)
```

The `DESeq()` function above performs three steps consecutively:
  
  - `estimateSizeFactors()` -- "Normalization" by library size.
    Note that *DESeq2* doesn’t actually normalize the counts,
    it uses raw counts and includes the size factors in the modeling. 
  
  - `estimateDispersions()` -- Estimate gene-wise dispersions.
  
  - `nbinomWaldTest(ddsObj)` -- Fit the negative Binomial GLM and calculate
    Wald statistics.

These functions could also be called separately,
which would be useful if you want to be able to change more defaults.

### The results table

```{r}
res <- results(dds)
res
```

By default, the results table prints statistics
**comparing the last level of the factor with the first level**:
that is, log-fold change and p-values describe
differences between these two levels specifically.
However, we can easily extract equivalent statistics for any pairwise comparison
among our factor levels, which we will see later.

For now, we will explore what each column in this table means:

- The `baseMean` column contains the mean expression level across all samples.

- The `log2FoldChange` column contains the log2-fold change of gene counts
  between the compared levels, that is, it represents the *effect size*.
  A log2-fold change of 1 indicates that the expression in the reference level
  is 2 times lower than that of the other level, a log2-fold change of 2 indicates
  a four-fold difference, a log2-fold change of 3 indicates an eight-fold difference,
  and so on. Similarly, negative log2-fold change values indicate an change
  in gene counts in the other direction.
  
- The `lfcSE` column indicates the uncertainty in terms of the standard error
  (SE) of the log2-fold change estimate.
  
- The `stat` column indicates the value for the Wald test's test statistic.

- The `pvalue` column reported the *uncorrected* p-value from the Wald test.

- Because we are testing significance for *many* genes,
  we need to correct for multiple testing.
  DESeq2 uses the Benjamini-Hochberg False Discovery Rate (FDR) correction,
  and these values are reported in the column `padj` (i.e., adjusted p-value).

A summary of this information about each column can be seen by running the
`mcols()` function:

```{r}
mcols(res)
```

### `NA` values in the results table

Some values in the results table can be set to `NA` for one of the following reasons:

- If within a row, all samples have zero counts,
  the `baseMean` column will be zero, and the log2-fold change estimates,
  p-value and adjusted p-value will all be set to `NA`.

- DESeq2 performs outlier detection using Cook's distance.
  If a gene contains a sample with an count outlier,
  both the p-value and adjusted p-value will be set to `NA`.

- DESeq2 also automatically filters low mean count genes in the sense that it
  does not include them in the multiple testing correction.
  Therefore, in such cases, the p-value will not be `NA`, but the adjusted p-value
  will be.
  
  Because we have very low power to detect differential expression for such
  low-count genes, it is beneficial to remove them prior to the multiple testing
  correction, so this correction becomes less severe for the remaining genes.

Let's see how many genes have `NA` p-values:

```{r}
# Number of genes with NA p-value:
sum(is.na(res$pvalue))

# As a proportion of the total number of genes in the test:
sum(is.na(res$pvalue)) / nrow(res)
```

And `NA` adjusted p-values:

```{r}
# Number of genes with NA p-value:
sum(is.na(res$padj))

# As a proportion of the total number of genes in the test:
sum(is.na(res$padj)) / nrow(res)
```

<br>

----

## DE analysis -- contrast two custom groups

Although not displayed in a particularly readable ways,
we can see which pairwise contrasts between different levels of the factor
are available:

```{r}
resultsNames(dds)
```

Not all pairwise contrasts between the 5 levels in our `group` factor are available:
instead, `control_mock`, which we set as the reference level, is being compared
with all 4 other levels.

Above, we looked at the results (p-values and so on) for the last of these
comparisons, which is what DESeq2 will show by default.
To see the results table for one of the other 3 comparisons,
we pass a vector to the `contrast` argument of the `results()` function with
the factor (`group`) and the two levels to be contrasted:

```{r}
my_contrast <- c("Ri_mock", "control_mock")

res <- results(dds,
               contrast = c("group", my_contrast))
```

How many adjusted p-values were less than 0.1?

```{r}
sum(res$padj < 0.1, na.rm = TRUE)
```

We'll also create an object with adjusted (shrunken) LFC estimates which will
be useful for visualization and ranking of genes by LFC.

```{r}
my_coef <- paste0("group_", paste0(my_contrast, collapse = "_vs_"))
my_coef
res_LFC <- lfcShrink(dds, coef = my_coef, type = "apeglm", lfcThreshold = 1)
```

Note that here, we had to provide the contrast ("coefficient") in the format
given by `resultsNames(dds)`, which looks confusing because it uses underscores
as a separator, and our factor levels also contain underscores.

<br>

----

## Exploring the results

### MA-plot:

For a nice overview of the results, we can plot a so-called "MA plot",
which plots, for each gene,
count differences (as LFC) between two groups on the y-axis,
and mean counts across both groups on the x-axis.

We can create an MA plot using DESeq2's `plotMA` function,
with significantly differentially expressed genes (using alpha = 0.1) in blue:

```{r}
plotMA(res, ylim = c(-5, 5))
```

To be able to customize the plot,
we'll use `returnData = TRUE` like we have done with previous plots,
and then plot the resulting dataframe with *ggplot2*:

```{r}
d <- plotMA(res, returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_continuous(limits = c(-10, 10)) +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "LFC")
```

We can see that lowly-expressed genes tend to deviate from an LFC of 0
(same mean expression levels in the two groups) much more than 0.
However, this is an artifact of noise overpowering signal with such low
expression values. We can also see that genes are never significantly expressed
in the same far left part of the plot due to this lack of power.

DESeq2 provides several methods to adjust LFC estimates for the low-expression
bias, one of which we used above to produce the `res_LFC` object.
Let's create another MA plot for these adjusted estimates:

```{r}
d <- plotMA(res_LFC, ylim = c(-5, 5), returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")
```

We can also make this plot interactive with Plotly, so we can see the identity
of each gene by hovering over the point:

```{r}
library(plotly)

d$gene <- rownames(d)

p_ma <- ggplot(d,
               aes(x = mean, y = lfc, color = isDE, text = gene)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

ggplotly(p_ma, tooltip = "text")
```

### Plot specific genes

We can also create plot of expression levels in individual genes,
which will be especially interesting for genes with highly significant differential
expression.

Let's plot the top-5 most significantly differentially expressed genes:

```{r, eval=TRUE}
top5 <- row.names(res[order(res$padj), ][1:5, ])

for(fgene in top5) {
  
  d <- plotCounts(dds,
                  gene = fgene,
                  intgroup = "group",
                  returnData = TRUE)

  p <- ggplot(d, aes(x = group, y = count)) +
          geom_point(position = position_jitter(w = 0.1, h = 0)) +
          labs(title = fgene) +
          theme_bw()
  
  print(p)
}

```

```{r, eval=FALSE, echo=FALSE}
# "Solyc02g069430.4" "Solyc06g050590.3" "Solyc01g111220.3" "Solyc10g047030.3" "Solyc04g079330.2"
d <- plotCounts(dds,
                gene = "Solyc02g069430.4",
                intgroup = "group",
                returnData = TRUE)

ggplot(d, aes(x = group, y = count)) +
  geom_point(position = position_jitter(w = 0.1, h = 0)) +
  labs(title = fgene) +
  theme_bw()
```

### Heatmap

We can create heatmaps with the `pheatmap` function --
for instance, for the 20 most highly expressed genes:

```{r, eval=TRUE}
ntd <- normTransform(dds)
selection <- order(rowMeans(counts(dds, normalized = TRUE)),
                   decreasing = TRUE)[1:20]
ntd_sel <- assay(ntd)[selection, ]
df_meta <- as.data.frame(colData(dds)[, c("AMF", "Treatment")])
```

```{r, eval=TRUE}
pheatmap(ntd_sel,
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         annotation_col = df_meta)
```

### Export the results

TBA

<br>

----

## DE analysis -- with two factors

TBA

```{r, eval=FALSE, echo=FALSE}
dds_2f_raw <- dds_raw

# AMF + Treatment: Test for the effect of Treatment (the last factor),
# ...controlling for the effect of AMF (the first factor).
design(dds_2f_raw) <- formula(~ AMF + Treatment)

# Or use interaction term: ∼ AMF + Treatment + AMF:Treatment
```

```{r, eval=FALSE, echo=FALSE}
# Run DESeq with the new design:
dds_2f <- DESeq(dds_2f_raw)
```

